Sets Up Autotesting on SNL Machines #426

mjschmidt271 · 2025-04-15T19:23:00Z

This adds the AT2 workflow to test on SNL machines.

Currently, the only testing we do with this workflow is for GPU/cuda on H100 cards.

This autotesting workflow is triggered by pull requests to main and by a handful of actions on an open PR to main (e.g., reopened, converted to ready from draft, etc.).

A couple of notes (as I understand things--could be mistaken in minor ways):

If a PR contains changes to the .github directory, a member of the snl-testing-admins team must add the CI-AT2_special_approval to kick off the autotesting.
Otherwise, any PR that is submitted by a member of the snl-testing team, and only contains commits from members of that team should automatically trigger this autotesting.
In the case that the PR is submitted by someone who is not a member of the snl-testing team or contains commits from someone outside of that team, an approving review by someone on the snl-testing team is required to trigger autotesting.

Given the above, I have not been able to test the latter 2 conditions because my changes are not yet merged into main

mjschmidt271 · 2025-04-15T19:24:23Z

@jaelynlitz - It probably makes sense to also remove the PNNL workflow with this PR, so let me know if that works for you

odiazib · 2025-04-15T19:28:16Z

.github/workflows/at2_snl.yml

@@ -0,0 +1,36 @@
+name: SNL-AT2


I wonder if we should name this 'AT' only?

The name of the software product that facilitates this is "AT2 (Autotester2)"--however, what we call it on our side doesn't make a bit of difference to me

jeff-cohere · 2025-04-15T19:53:52Z

Is this test failure an issue for this PR, or is it a known and unrelated thing?

The following tests FAILED:
	 10 - mam4_nucleate_ice_unit_tests (Failed)

jaelynlitz · 2025-04-15T20:43:11Z

.github/workflows/at2_gcc-cuda.yml

+        uses: actions/checkout@v4
+        with:
+          persist-credentials: false
+          show-progress: false
+          submodules: recursive
+      - name: Cloning Haero
+        uses: actions/checkout@v3


Can we use the same version of checkout across the actions? 😄

jaelynlitz · 2025-04-15T20:44:42Z

.github/workflows/at2_gcc-cuda.yml

@@ -0,0 +1,104 @@
+name: gcc-cuda


What versions of gcc and cuda are we using?

jaelynlitz · 2025-04-15T20:47:48Z

.github/workflows/at2_gcc-cuda.yml

+            -DNUM_VERTICAL_LEVELS=72 \
+            -DENABLE_COVERAGE=OFF \
+            -DENABLE_SKYWALKER=ON \
+            -DCMAKE_CUDA_ARCHITECTURES=90 \


Since this is a var above I'm inclined to reference the var over a hard coded value

jaelynlitz · 2025-04-15T20:49:22Z

@jaelynlitz - It probably makes sense to also remove the PNNL workflow with this PR, so let me know if that works for you

I can remove it in a subsequent PR so we don't delay yours being merged

jaelynlitz · 2025-04-15T20:53:22Z

One more thought @mjschmidt271 if it makes sense to add a README like we had for PNNL CI? Seems like this was less complicated though

mam4xxSNL · 2025-04-15T23:59:36Z

Is this test failure an issue for this PR, or is it a known and unrelated thing?
The following tests FAILED:
	 10 - mam4_nucleate_ice_unit_tests (Failed)

~~@odiazib @singhbalwinder It looks to me like it may be addressed by currently open PRs--is that correct?~~

Well, it appears to have been a one-off and is now passing 🧐

codecov · 2025-04-16T00:21:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.68%. Comparing base (429f0a8) to head (4883778).
Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #426      +/-   ##
==========================================
- Coverage   95.69%   95.68%   -0.01%     
==========================================
  Files          46       46              
  Lines        9446     9438       -8     
==========================================
- Hits         9039     9031       -8     
  Misses        407      407

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jeff-cohere · 2025-04-16T00:48:29Z

Is this test failure an issue for this PR, or is it a known and unrelated thing?
The following tests FAILED:
	 10 - mam4_nucleate_ice_unit_tests (Failed)
~~@odiazib @singhbalwinder It looks to me like it may be addressed by currently open PRs--is that correct?~~

Well, it appears to have been a one-off and is now passing 🧐

Hrmm! Well, here's what was reported, in case it rears its head again:

10: /home/runner/work/mam4xx/mam4xx/src/tests/mam4_nucleate_ice_unit_tests.cpp:203: FAILED:
10:   {Unknown expression after the reported line}
10: due to a fatal error condition:
10:   SIGFPE - Floating point error signal

singhbalwinder · 2025-04-16T04:52:52Z

Thanks a lot, Mike, for adding this testing pipeline! As @jaelynlitz mentioned, we can remove PNNL's pipeline in a separate PR. We might want to run both pipelines for a few PRs before removing PNNL's pipeline just to monitor both pipelines.

add at2 testing files

62d16ce

mjschmidt271 requested review from jaelynlitz, jeff-cohere, mam4xxSNL, odiazib, overfelt and singhbalwinder April 15, 2025 19:23

mjschmidt271 added the CI-AT2_special_approval label Apr 15, 2025

odiazib approved these changes Apr 15, 2025

View reviewed changes

mam4xxSNL approved these changes Apr 15, 2025

View reviewed changes

jaelynlitz reviewed Apr 15, 2025

View reviewed changes

respond to PR comments, add readme

44ad21a

--fixup=HEAD

e2d52b6

print ref in custom action

4883778

jeff-cohere approved these changes Apr 16, 2025

View reviewed changes

singhbalwinder approved these changes Apr 16, 2025

View reviewed changes

mam4xxSNL merged commit 6aca623 into main Apr 16, 2025
12 checks passed

mam4xxSNL deleted the mjs/snl-autotester branch April 16, 2025 16:54

Sets Up Autotesting on SNL Machines #426

Sets Up Autotesting on SNL Machines #426

Uh oh!

Conversation

mjschmidt271 commented Apr 15, 2025

Uh oh!

mjschmidt271 commented Apr 15, 2025

Uh oh!

odiazib Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

mjschmidt271 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

jeff-cohere commented Apr 15, 2025

Uh oh!

jaelynlitz Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

mjschmidt271 Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

jaelynlitz Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

jaelynlitz Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

jaelynlitz commented Apr 15, 2025

Uh oh!

jaelynlitz commented Apr 15, 2025

Uh oh!

mam4xxSNL commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeff-cohere commented Apr 16, 2025

Uh oh!

singhbalwinder commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mam4xxSNL commented Apr 15, 2025 •

edited

Loading

codecov bot commented Apr 16, 2025 •

edited

Loading